iOS MachineLearning 系列(13)—— 语音与音频相关的AI能力
在语音分析方面,iOS中提供了原生的Speech框架,这个框架可以实时的将语音解析成文本。这个能力非常强大,使用它我们可以实现类似实时翻译的功能。对于非语音的音频,也有一些原生的AI能力可以使用,例如分析语音的类型。SoundAnalysis框架能够识别300多种声音,我们也可以使用自己训练的模型来处理定制化的音频识别需求。
1 - 进行语音识别
使用Speech框架来进行语音识别非常简单,并且其支持多种语言。使用此功能前,我们需要先请求用户授予权限。在Info.plist文件中新增如下key:
1
| NSSpeechRecognitionUsageDescription
|
此key设置的值为字符串文案,会在使用Speech框架时弹窗展示。
需要注意,Speech框架提供的并非是完全依赖本地的AI能力,其需要连接Apple的服务器来实现功能,因此在使用时要确保网络的正常。
首先需要定义个识别器对象,如下:
1
| let recognizer = SFSpeechRecognizer(locale: Locale(identifier: "zh-Hants"))
|
其中locale参数设置所识别为的语言。
之后需要创建一个语音识别请求,并发起识别任务,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| let path = Bundle.main.path(forResource: "12168", ofType: ".wav") let url = URL(fileURLWithPath: path!)
let request = SFSpeechURLRecognitionRequest(url: url)
let label = UILabel(frame: CGRect(x: 0, y: 100, width: view.frame.width, height: 400)) label.numberOfLines = 0 view.addSubview(label)
recognizer?.recognitionTask(with: request, resultHandler: { result, error in print(result?.bestTranscription.formattedString, error) label.text = result?.bestTranscription.formattedString })
|
运行上面的代码,如果提供的音频文件是正常的语音文件,即可看到识别效果。上面的结果回调会根据音频的长度来多次回调,每次都会根据上下文进行之前识别结果的矫正。
Speech框架不仅支持语音文件的识别,也支持实时进行语音数据流的识别。只需要创建不同的Request类即可。我们可以先来看下语音分析请求的父类SFSpeechRecognitionRequest:
1 2 3 4 5 6 7 8 9 10 11 12
| open class SFSpeechRecognitionRequest : NSObject { open var taskHint: SFSpeechRecognitionTaskHint open var shouldReportPartialResults: Bool open var contextualStrings: [String] open var requiresOnDeviceRecognition: Bool open var addsPunctuation: Bool }
|
其中taskHint属性用来设置识别类型,此枚举定义如下:
1 2 3 4 5 6
| public enum SFSpeechRecognitionTaskHint : Int, @unchecked Sendable { case unspecified = 0 case dictation = 1 case search = 2 case confirmation = 3 }
|
要对语音文件进行分析,使用SFSpeechURLRecognitionRequest子类,如果要实时识别语音流,则使用SFSpeechAudioBufferRecognitionRequest子类即可,SFSpeechAudioBufferRecognitionRequest定义如下:
1 2 3 4 5 6 7 8 9 10
| open class SFSpeechAudioBufferRecognitionRequest : SFSpeechRecognitionRequest { open var nativeAudioFormat: AVAudioFormat { get } open func append(_ audioPCMBuffer: AVAudioPCMBuffer) open func appendAudioSampleBuffer(_ sampleBuffer: CMSampleBuffer) open func endAudio() }
|
SFSpeechRecognizer类用来发起语音识别请求,定义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| open class SFSpeechRecognizer : NSObject { open class func supportedLocales() -> Set<Locale> // 用户授权状态 open class func authorizationStatus() -> SFSpeechRecognizerAuthorizationStatus // 请求用户授权 open class func requestAuthorization(_ handler: @escaping (SFSpeechRecognizerAuthorizationStatus) -> Void) // 构造方法,使用当前系统语言 public convenience init?() // 构造方法,设置语言 public init?(locale: Locale) // 功能是否可用 open var isAvailable: Bool { get } open var locale: Locale { get } open var supportsOnDeviceRecognition: Bool weak open var delegate: SFSpeechRecognizerDelegate? open var defaultTaskHint: SFSpeechRecognitionTaskHint open func recognitionTask(with request: SFSpeechRecognitionRequest, resultHandler: @escaping (SFSpeechRecognitionResult?, Error?) -> Void) -> SFSpeechRecognitionTask open func recognitionTask(with request: SFSpeechRecognitionRequest, delegate: SFSpeechRecognitionTaskDelegate) -> SFSpeechRecognitionTask open var queue: OperationQueue }
public protocol SFSpeechRecognizerDelegate : NSObjectProtocol { optional func speechRecognizer(_ speechRecognizer: SFSpeechRecognizer, availabilityDidChange available: Bool) }
|
如果使用闭包的方式来发起请求,则结果会在闭包回调中给到,如果采用代理的方式,则会通过代理回调返回。SFSpeechRecognitionTaskDelegate协议如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
| public protocol SFSpeechRecognitionTaskDelegate : NSObjectProtocol { optional func speechRecognitionDidDetectSpeech(_ task: SFSpeechRecognitionTask) optional func speechRecognitionTask(_ task: SFSpeechRecognitionTask, didHypothesizeTranscription transcription: SFTranscription) optional func speechRecognitionTask(_ task: SFSpeechRecognitionTask, didFinishRecognition recognitionResult: SFSpeechRecognitionResult) optional func speechRecognitionTaskFinishedReadingAudio(_ task: SFSpeechRecognitionTask) optional func speechRecognitionTaskWasCancelled(_ task: SFSpeechRecognitionTask) optional func speechRecognitionTask(_ task: SFSpeechRecognitionTask, didFinishSuccessfully successfully: Bool) }
|
其中SFTranscription类用来描述识别中间结果,如下:
1 2 3 4 5 6 7 8 9 10
| open class SFTranscription : NSObject, NSCopying, NSSecureCoding { open var formattedString: String { get } open var segments: [SFTranscriptionSegment] { get } open var speakingRate: Double { get } open var averagePauseDuration: TimeInterval { get } }
|
其中SFTranscriptionSegment是具体的词组,里面封装了更多详细的信息:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| open class SFTranscriptionSegment : NSObject, NSCopying, NSSecureCoding { open var substring: String { get } open var substringRange: NSRange { get } open var timestamp: TimeInterval { get } open var duration: TimeInterval { get } open var confidence: Float { get } open var alternativeSubstrings: [String] { get } open var voiceAnalytics: SFVoiceAnalytics? { get } }
|
SFSpeechRecognitionResult类描述了分析的结果,实际上是一组SFTranscription的聚合。定义如下:
1 2 3 4 5 6 7 8 9 10
| open class SFSpeechRecognitionResult : NSObject, NSCopying, NSSecureCoding { @NSCopying open var bestTranscription: SFTranscription { get } open var transcriptions: [SFTranscription] { get } open var isFinal: Bool { get } open var speechRecognitionMetadata: SFSpeechRecognitionMetadata? { get } }
|
SFSpeechRecognitionMetadata中封装了语音的基础数据:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| open class SFSpeechRecognitionMetadata : NSObject, NSCopying, NSSecureCoding { open var speakingRate: Double { get } open var averagePauseDuration: TimeInterval { get } open var speechStartTimestamp: TimeInterval { get } open var speechDuration: TimeInterval { get } open var voiceAnalytics: SFVoiceAnalytics? { get } }
open class SFVoiceAnalytics : NSObject, NSCopying, NSSecureCoding { @NSCopying open var jitter: SFAcousticFeature { get } @NSCopying open var shimmer: SFAcousticFeature { get } @NSCopying open var pitch: SFAcousticFeature { get } @NSCopying open var voicing: SFAcousticFeature { get } }
|
发起语音请求后,会返回一个SFSpeechRecognitionTask对象,此任务对象可以对当次分析过程进行控制,如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| open class SFSpeechRecognitionTask : NSObject { open var state: SFSpeechRecognitionTaskState { get } open var isFinishing: Bool { get } open func finish() open var isCancelled: Bool { get } open func cancel() open var error: Error? { get } }
public enum SFSpeechRecognitionTaskState : Int, @unchecked Sendable { case starting = 0 case running = 1 case finishing = 2 case canceling = 3 case completed = 4 }
|
2 - 音频类别识别
iOS内置API的音频分析能力可以方便的对音频进行分类,例如人声,乐器声等等。内置的SoundAnalysis框架能够分析识别300余种音效,当然其也支持使用自定义的模型来进行分析,本篇文章将只涉及到API的使用。
SNAudioFileAnalyzer类是音频分析的处理类,例如:
1
| let analyzer = try! SNAudioFileAnalyzer(url: URL(fileURLWithPath: Bundle.main.path(forResource: "12168", ofType: ".wav")!))
|
SNAudioFileAnalyzer在实例化时会包装一个音频文件地址,后续将对此音频进行分析。首先需要创建一个分析请求,如下:
1
| let request = try! SNClassifySoundRequest(classifierIdentifier: .version1)
|
其参数设置使用的算法版本。
通过如下方法来向SNAudioFileAnalyzer实例中添加一个请求,并设置请求过程的监听:
1
| try! analyzer.add(request, withObserver: self)
|
之后调用analyze方法来触发请求的执行:
对请求过程的监听对象需要实现SNResultsObserving协议,如下:
1 2 3 4 5 6 7 8
| public protocol SNResultsObserving : NSObjectProtocol { func request(_ request: SNRequest, didProduce result: SNResult) optional func request(_ request: SNRequest, didFailWithError error: Error) optional func requestDidComplete(_ request: SNRequest) }
|
请求的结果数据为SNResult,这个是基础协议,真正将返回的对象是SNClassificationResult类型的,定义如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| open class SNClassificationResult : NSObject, SNResult { open var classifications: [SNClassification] { get } open var timeRange: CMTimeRange { get } open func classification(forIdentifier identifier: String) -> SNClassification? }
open class SNClassification : NSObject { open var identifier: String { get } open var confidence: Double { get } }
|
SoundAnalysis框架本身比较简单,我们再来看下分析请求类,SNRequest是请求类的基类,为了便于后续framework的升级,我们使用SNClassifySoundRequest类创建请求:
1 2 3 4 5 6 7 8 9 10 11 12
| open class SNClassifySoundRequest : NSObject, SNRequest { open var overlapFactor: Double open var windowDuration: CMTime open var knownClassifications: [String] { get } public init(mlModel: MLModel) public init(classifierIdentifier: SNClassifierIdentifier) }
|
最后Analyzer相关的类主要用来对请求进行控制,并决定要分析的音频。SoundAnalysis框架支持直接对音频文件进行分析,也支持对音频数据流进行分析,使用的类如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| open class SNAudioFileAnalyzer : NSObject { public init(url: URL) throws open func add(_ request: SNRequest, withObserver observer: SNResultsObserving) throws open func remove(_ request: SNRequest) open func removeAllRequests() open func analyze() open func analyze(completionHandler: @escaping (Bool) -> Void) open func analyze() async -> Bool open func cancelAnalysis() }
open class SNAudioStreamAnalyzer : NSObject { public init(format: AVAudioFormat) open func add(_ request: SNRequest, withObserver observer: SNResultsObserving) throws open func remove(_ request: SNRequest) open func removeAllRequests() open func analyze(_ audioBuffer: AVAudioBuffer, atAudioFramePosition audioFramePosition: AVAudioFramePosition) open func completeAnalysis() }
|
完整的示例代码可以在如下地址找到:
https://github.com/ZYHshao/MachineLearnDemo